Overview of Optimization Theory and the Four main types of Optimization Problems
Optimization Theory is a branch of mathematics devoted to solving optimization problems. Optimization problems are mathematical functions where we want to minimize or maximize the function value. These types of problems are found numerously in computer science and applied mathematics. Finding good solutions to these problems is what gives data science and machine learning models good accuracy, as they themselves are built off numerical methods used for finding good solutions to minimize error functions. Examples of common optimization problems in Machine Learning are minimizing MSE, MAE, Cross-Entropy, etc.
In a brief overview, optimization theory can be reduced down to three questions:
Knowing how your objective function looks like in terms of number of dimensions, domain, and local extrema count is vital in choosing an algorithm. In addition, it is necessary to understand the data type for each of the variables: discrete or continuous? Lastly, knowing the set of constraints is important to limit your algorithm in searching the domain space for only feasible solutions.
If you remember from calculus, optima, otherwise known as critical points are points in the domain space such that the slope of the function at that point is zero. These points are commonly referred to as extrema, classified as either minima or maxima. Extrema can be classified into three four main types:
Weak extrema are critical points such that the neighboring points around has the same function value, creating a plateau.
Strong Extrema are critical points such that it is has either the largest (for maximization) or smallest (for minimization) function value from the neighboring points, creating a valley or mountain.
Global Extrema are Strong Extrema points such that its function value is either the largest (for maximization) or smallest (for minimization) globally for all Strong Extrema points.
Saddle Points (will not be discussed in the course) represent inflexion points between concave up and concave down trends.
There are four main types of optimization problems:
Unconstrained Problems are just like they sound, unconstrained. These are functions where we want to minimize or maximize the function given a set of variables whose domain is either continuous, discrete, or both. Here we have a definition:
DEF: Unconstrained Problem:
- minimize $f(x)$, $x=(x_1,x_2,\dots,x_n)$;
- subject to $x_j\in domain(x_j)$.
- Where $n$ denotes number of variables.
Constrained Problems are just like Unconstrained Problems in terms of set up but have unique constrained of the input space. Here we have a definition:
DEF: Constrained Problem:
- minimize $f(x)$, $x=(x_1,x_2,\dots,x_n)$
- subject to $x_j\in domain(x_j)$, $g_m(x)\leq 0$, $m=1,\dots,n_g$. Where $n$ denotes number of variables, $g_m(x)$ denotes a constraint, and $n_g$ denotes the number of constraints.
Example
First, we have the objective function denoted by A. Then we have the global unconstrained minimum at the bottom right at point E. However, we introduce an inequality constraint denoted by Line D where any value in the checkered region is infeasible. Thus, we are left with a part of the domain to work with. Point B and C are two critical points in the feasible domain space. However, the next global minimum in the constrained environment would be at point F. Thus, in scenarios such as these we would like our algorithm to steer away from the infeasible domain space and find the next best minimum value.
An example of the interaction of the domain space between two variables in a constrained environment.
Here we can see that X and Y are continuous variables which can range between 0 and 2. However, we add a constraint such that as X tends towards 2, Y decreases; but if Y tends towards 2, X starts to decrease. This can be seen by the blue semi-ring, which represents the feasible input space and the transparent background represents the infeasible domain space.
Multi-Solution Problems are functions where there are no unique global extrema; instead there are tied global solutions. In these instances we want to return all the tied global solutions (hence the name multi-solution) to give us a range of options to choose from.
In a real application, we might want to minimize some type of cost function, where we’ve found that there exists multiple solutions. Our goal is to return these multiple solutions so that we have the ability to choose which point or vector of variable values best fits our interest. Here we have an exact definition:
DEF: Multi Solution Problem: Find a set of solutions, $X=\{x_1,\dots,x_n\}$, such that each $x_i$ is a global extrema for the optimization problem.
Example
It is a variation of the sin wave function. As we can see, we have tied global minima and maxima for this problem. Note However that not all multi solution problems occur periodically like a sin wave, where it is easy to predict the next value, these tied global extrema will most likely spuriously throughout the domain space and without pattern.
Multi-Objective Problems (MOP) are problems where we have many different objectives, or optimization problems, that we need to solve simultaneously. There are two main methods for solving these types of problems:
Weighted Aggregation is simply an aggregate of all the objective functions. We simply sum up each objective function, multiplied by an associated weight value, and try to minimize or maximize that sum. It is usually assumed that the sum of the weights are equal to one. Here we have an example aggregate function we want to simultaneously minimize F1 and F2. Note that each Function has an associated weight value, where larger weight values indicate which function has the most influence.
DEF: Multi Objective Problem (Weighted Aggregation):
- minimize $F(x)$, $F(x)=\sum_{k=1}^{n_k}f_k(x)$ subject to $x_j\in domain(x_j)$, $g_m(x)\leq 0$, $m=1,\dots,n_g$, $w_k(x)\geq 0$, $k=1,\dots,n_k$ *Where $n$ denotes number of variables, $g_m(x)$ denotes a constraint, $n_g$ denotes the number of constraints, and $n_k$ denotes number of objectives, and usually $\sum_{k=1}^{n_k}w_k=1$.
Pareto Optimality is another way to solve MOP by classifying possible solutions in a dominance structure. Domination is a way to distinguish good solutions from bad. There are two main types of domination: Strong and Weak.
Strong domination occurs when a possible solution is equal to or better than another possible solution in all objectives, and is better than in at least one objective.
DEF: Strong Domination (Pareto-Optimality): A decision vector $x_1$ strongly dominates another decision vector $x_2$ if and only if:
- $x_1$ is equal to or better than $x_2$ in all objectives: $f_k(x_1)\leq f_k(x_2) \forall k=1,\dots,n_k$
- $x_1$ is better than $x_2$ in at least one objective: $\exists k | f_k(x_1)<f_k(x_2)$
Weak domination on the other hand states that a possible solution is not worse than another solution in all objectives.
DEF: Weak Domination (Pareto-Optimality): A decision vector $x_1$ strongly dominates another decision vector $x_2$ if and only if:
- $x_1$ is better than $x_2$ in at least one objective: $\exists k | f_k(x_1)<f_k(x_2)$
Benchmarking test functions are designed mathematical optimization problems for testing algorithms. These are artificially designed problems with no application, but designed in such a way that they are ‘as difficult’ or more ‘difficult’ than real world applications. In literature, instead of comparing algorithms based upon certain applications which may require extensive domain knowledge concerning the subject; researchers often compare functions by how well they can find global extrema in these benchmarking test functions. There are many different types of benchmarking test functions, here are three test functions that are commonly found in literature:
In summary, optimization problems can be classified into four different categories. Unconstrained problems where the bounds are fixed such that they do not change. Constrained problems where the domain changed dependent upon variable values. Multi solution problems where there are no unique global minima/maxima and we want to return all values. Lastly, multi objective where we want to find the pareto front.
Optimization problems are extremely abundant in numerous fields of research, whether it be from traffic scheduling, maximizing production based off different variables, efficient electricity use, to neural network training, game modeling, or optimizing data science models. The goal in these circumstances is to minimize or maximize the optimization function, where each optima represents a possible solution. However, we do not want just a possible solution, we want the best solution, the global minima or maxima. So how do we find these global extrema points? We could use standard numerical methods such as Newtons method or conjugate gradient; or perhaps we could use other optimization methods such as leapfrog, hill climber, or simulated annealing. In this series we will cover using evolutionary computation as a means to find these global extrema.
Why would we want to use evolutionary computation in first place over classical optimization techniques? Well the problem with classical optimization algorithms are usually that they are
On the other hand evolutionary computation
Because of these reasons, evolutionary computation are commonly used on problems that classified as NP hard problems, such as traveling salesman or scheduling, or when the optimization function is not differentiable.
Evolution can be described as a process by which individuals become ‘fitter’ in different environments through adaptation, natural selection, and selective breeding.
Here we have a picture of the famous finches Charles Darwin depicted in his journal while on the Galapagos Island. He noted the extreme diversity within the species for beak sizes. This is an example of mutation.
Here we have a picture of a trivial example of natural selection, adaptation, and selective breeding.
In evolutionary computation, we try to model these principles to find the best solution to a problem. Each possible solution to a problem is represented as an individual in a pool of a population, where we perform adaptation, natural selection, and selective breeding on those possible solutions to ultimately find the best solution for the problem.
For a brief overview of the canonical genetic algorithm,
Cells are one of the most basic building block of organisms. Each cell is built off strands of DNA. Chromosomes, the main structures of DNA, contain large strands of genes, where each gene represents some type of inherited characteristic. In EC, we want to model these characteristics into our optimization problems. An individual in EC is simply a point in the domain space for that optimization problem, where each variable value for the function is represented by a gene. Thus, each point refers to an individual where each component of that point, i.e. variable value, is encoded as a gene in the genome. The genotype describes the genetic makeup of an individual. On the other hand, the phenotype defines what the individual looks like in the environment.
Evolutionary Algorithms (EA) are population based search algorithms, meaning it works by taking a pool of initial points and searches these points in parallel.
Unlike standard numerical methods, such as Newtons method where you only feed it one initial value, EA’s work by using the diversity of the population to search for better solutions. In addition, we also need a way to represent the entire search space/domain of the problem. If our initial population only includes values from a centrally located part of the search space then our algorithm will only find optimal solutions near that search space.
Therefore, to ensure diversity, we need to randomly sample uniformly from our domain space. Now, whenever we are dealing with constrained problems where the bounds are different dependent upon variable values, our initial uniform sampling might lead to infeasible solutions. To get around this we either have to hard code this constraint for our uniform sampling or just sample over the entire input space but only keep feasible solutions.
Lastly, the choice of the initial population size is crucial to the convergence and complexity of the algorithm.
The choice of size for the initial population is dependent upon the
In evolution, the motto is that only the best fit individuals in the environment survive. How do we determine which individuals are fit?
In EC the fitness function is the optimization problem itself, where values that have larger values have better fitness. In the case of minimization, we can scale the function values such that smaller values result in larger fitness values. The selection of the fitness function is directly dependent upon what problem is needing to be optimized.
In Optimization Theory, there are four main types of problems, unconstrained, constrained, multi objective, and multi solution. EA’s can work on all these types of problems but may have difficulty with two, constrained and multi solution.
1) Constrained problem: Because the domain is dynamic in a constrained environment, our algorithm may encounter infeasible solutions. To get around this we have two main solutions.
DEF: Penalty Method: minimize $F(x)$ $$F(x)=f(x)+\lambda p(x)$$ where $p(x)$ denotes the penalty value of $x$, and $\lambda$ dentoes a constant to control the strength of the penalty
2) Multisolution problem: EA’s tend to converge to singular points or a group of a few points. In instances where there are many solutions that need to be returned, EA’s can struggle in these circumstances.
Once we have our initial population and compute the fitness of each individual, we need a way to determine who will be selected for survival and reproduction.
In EA, there is a trade off between exploitation and exploration. 1) Exploitation simply refers to the convergence quality of the algorithm. If we select an algorithm that always chooses the fittest individuals to survive, then we are exhibiting high exploitation but low exploration. High exploitation leads to quick convergence; but, it can also lead to premature convergence.
For example, in the picture above (assuming maximization), the global maximum is at B, but if that position has not been explored by our algorithm, then it might prematurely converge to A or C depending on how much of the input space has been explored. In the picture however, the initial point lies between A and B. Therefore it will either converge to A or B, depending on random chance from the algorithm.
2) In opposition to exploitation is Exploration, this selection type does not always choose the fittest individuals to mate, encouraging exploration of the input space. However, by encouraging exploration, the algorithm can have poor convergence qualities.
In practice, it is sometimes best to utilize both, exploration in the beginning and then exploitation towards the end.
Selective Pressure refers to how long it would take to produce a uniform population, therefore high selective pressure decreases diversity whereas low selective pressure encourages exploration.
There are many selection methods, but here are the most popular:
Random selection is a simple method that has good exploration qualities as it chooses the individuals to survive at random; however, because of this, it does not guarantee that the current best solution will survive, leading to possible solutions being lost. It is most commonly paired with elitism as we will see in a moment.
Proportional selection creates a probability distribution from the fitness values by dividing each fitness value by the sum of fitness values. Because selection is based off of probability, it ensures good diversity while also ensuring good convergence; however, because the probability is based off the proportion of the fitness value to the rest of the fitness values, individuals early on with large fitness values can seem to dominate the selection process.
To alleviate the problem of large fitness values from dominating the reproduction space in proportional selection, Rank Based Selection has been proposed. It is like that of proportional except the probability of being chosen is not based off the raw fitness value, but rather the ranking of the individual. From the fitness values, individuals are ranked from worst to best where the best has rank one, and then the proportion decreases for individuals with higher rankings either linearly or nonlinearly. Above, we have an example of rank based selection except where the best is ranked highest and worst is ranked lowest. The exact ranking of an individual depends upon what type of rank based selection equation you are using.
Tournament Selection on the other hand, randomly selects a group of individuals, where the amount is known as tournament size, and the solution with the best fitness value is chosen from the tournament to survive. This is performed to ensure good convergence quality as the best individual is chosen but also has a component of diversity through tournament size. Note that a tournament size of 1 is essentially equivalent to random selection; therefore, lower tournament sizes ensure good diversity, while large tournament sizes ensure that the best fitness value will be chosen.
Boltzmann Selection is a another form of selection where the probability of being chosen is based on a simulated annealing equation. We will cover Boltzmann selection in the next unit when we discuss how to choose for survival between the offspring and the parents.
1) Elitism simply takes the best percent of the current population and always carries them over to the next generation. This guarantees good convergence, but can also lead to poor diversity if the elitism proportion is large. Elitism is usually paired with random selection or any of the other techniques to get a mixture of good diversity and convergence.
2) Hall of Fame is a mechanism where the best individual of the population is set aside in the Hall of Fame, where after the algorithm has stopped, the best solution from this hall of fame is chosen. By doing this, it allows the algorithm to favor good exploration with random selection without the fear of losing the best individual.
Note that all of these methods are based off probability of selection, therefore an individual can be selected multiple times for survival and reproduction.
Now that we’ve selected which individuals will reproduce, we need a way to share genetic material between the parents and introduce more. In EA’s this is performed through Mutation and Crossover.
1) Crossover is how the genome of two parents are mixed together to create the offspring. It works by selecting a crossover point and then crossing over the genome between the parents to create two new offspring. We can see this by the following example:
2) Mutation on the other hand, introduces new genetic material by mutating one gene or multiple genes in the genome of the offspring by randomly flipping bits. In practice, mutation is mainly used on poor individuals to encourage exploration while not mutating much on good solutions to ensure good convergence.
Now that we’ve defined our reproduction operators, we simply repeat the entire process of selection and reproduction until the stopping conditions are met.
The four most common types of stopping criterions:
Common stopping criterions used are when there is no change in best fitness value after a certain number of generations. However, this could have problems where elitism is not used or the best solution is not chosen per generation, as the best solution value will fluctuate. In addition, if elitism is used, it can be shown that EA’s can work for long periods of generations in stagnation before finding a better solution.
One could instead stop when there is no change in mean fitness value, indicating a population has converged about an area. However, if the selection and reproduction operators encourage exploration over exploitation, the mean fitness value for the population will fluctuate with such great variance that this condition will never be met.
On the other, hand one could stop when an acceptable solution has been found, such as with Neural Network training when the MSE is below one, or for classification when the Cross Entropy Log Loss is near zero. However this can lead to premature exiting from the algorithm where a future better possible solution could have been found.
Lastly, the one most commonly found in practice, is just to simulate the algorithm for many generations and terminate when the maximum generation count has been reached.
There are many different fields of evolutionary computation.
The importance of an initial population is to ensure good initial diversity, along with choice of population size. The fitness function in most cases will be the optimization problem itself. Next, we need to decided how to encode the chromosome and what the phenotype will be. Next, we need to choose our selection and reproduction operators. With these two, remember the tradeoff between exploitation and exploration. Finally, we need to choose when we will stop our algorithm.
First, we will cover the standard canonical genetic algorithm. Then we will cover floating point representation, and move straight into crossover and mutation operators. Then we will cover genetic algorithm variants, advanced topics, and how to handle different optimization problems. Throughout the post we will have a few examples given in Python code. Because this unit will be quite extensive, it will be broken down into parts.
The canonical genetic algorithm is regarded as the simplest and one of the earliest genetic algorithms ever used in practice. It utilizes binary/bit string representation of the genome for encoding and decoding, proportional selection through roulette wheel, one point crossover and uniform mutation in the genome.
In this algorithm, those who survive are the offspring from the selected parents for reproduction. In addition, instead of always creating the offspring through mutation or crossover, the offspring has a probability of being crossed over or mutated, where if they were not crossed over or mutated, they default to a copy of the parent with the best fitness value. The purpose of implementing a probability of crossover and mutation is to prevent gene information loss from the parents as the offspring might be worse than the parents.
1) The canonical genetic algorithm utilizes binary representation, as it is more inspired from biology, where evolution occurs in the DNA by altering the chemical compounds, like flipping bits for mutation or crossover of the genome. However, binary representation is no longer used in mass due to the problem of having to encode and decode floating point numbers into binary, especially if grays encoding is used to get around the hamming cliff problem discussed in the previous unit.
2) Floating point representation leaves the numbers as it is, thus saving time by not encoding or decoding into binary; however, it is less intuitive than binary representation as how does one mutate or crossover floating point numbers? We will address this problem later on.
Lastly, it might be questioned upon the accuracy of both representations, is floating point representation better in terms of finding solutions than binary? Well, it has been shown that floating point representation can work as well and sometimes better than standard binary encoding, so there is no reason to fret about algorithm quality if floating point representation is used.
Crossover operators can be classified into three types, asexual, sexual and multi-recombination.
In addition, from the selected parents for creating offspring, we have a probability of performing crossover in order to not lose the genetic information of the parents by always crossing over, where higher crossover probabilities equate to higher chances of crossing over. If crossover is not applied, then the offspring is a copy of the fittest parent. In addition, if one offspring is generated, it could replace the worst parent either directly or probabilistically through something like Boltzmann selection.
There are three main types of crossover techniques for binary encoding, one point, two point, and uniform. We can see examples of these above.
There are two main types of floating point crossover techniques.
For our examples, we will use two common crossover techniques.
$$\tilde{x}_{ij}=(1-\gamma)x_{1j}+\gamma x_{2j},\gamma\in[0,1]$$Note $\gamma=0.5$ is mean \ Make $\gamma\sim N(0.5,0.15)$ or $\gamma\sim U(0,1)$ \ Where $x_1$ and $x_2$ denote the parents, $\tilde{x}_i$ denotes the $i$'th child and $j$ denotes the $j$'th variable
Note that this can be extended for multiple parents where each parent value has its own gamma value where the sum of the gamma values equal one:
$$\tilde{x}_{ij}=\sum\gamma_px_{pj}, \sum\gamma_p=1$$Intuitive Crossover(Inherit from Parent Probabistically): \
Two Parents: $\tilde{x}_{ij}=x_{1j}$ or $\tilde{x}_{ij}=x_{2j}$ \ Multi Parent: $\tilde{x}_{ij}=x_{pj}$
Above we have a little bit of a visual representation of what happens when we use the first crossover technique. This was performed by randomly selected two individuals to mate and reproduce an offspring through a linear combination, where lambda was uniformly random from 0 to 1.
We can definitely see that
Now, we have above a visual representation of what happens when we use the second crossover technique. This was performed by randomly selected two individuals to mate and switching the x and y values of the two parents to create two offspring.
We can see in the pictures on the right that by generation 300 the algorithm leaves some major holes that it does not explore. This is due to the fact that this crossover technique does not introduce any new genetic material.
In our example, because no initial value had an x value of -50, none of the offspring by generation 300 explored near that region. This is why it is vital to pair crossover techniques with mutation, which we will cover next.
Now that we’ve performed crossover on our offspring, we need a way to introduce new genetic material, this is done through mutation.
For binary representation, we usually have two main ways to perform mutation, (1) one of them is through uniform mutation where each bit has an equal chance of being flipped or (2) by only randomly flipping one of the bits.
For floating point representation there are many different mutation techniques; however, I will first introduce a basic mutation method where we add a random uniform value to each component of the offspring from the uniform domain of the bounds for each given variable: $$x_j=x_j+U(-x_j^{bound},x_j^{bound})$$
Notice that if this bound is large it would encourage exploration while if it is small it will encourage exploitation.
The choice for the max mutation value is up to the designer, but it is usual to keep it at around 1% of the total domain space for a given variable.
Above we have a visual representation of what happens when we use the mutation technique in floating point representation. This was performed by adding a small random value to each of the x and y direction for each individual.
We can see already that mutation is simply exploring and creating a small cluster around the point. This shows the power of mutation by introducing new genetic material.
Lastly, we have above a visual representation of what happens when we use the second crossover technique paired with mutation. This was performed by randomly selected two individuals to mate and switching the x and y values of the two parents to create two offspring and then adding a small random uniform value to both the x and y direction.
If you remember, values near -50 were not explored due to the initial generation lacking any points with that value; however, because we added mutation it introduced new genetic material that allowed crossover to explore that area. This shows the power of crossover paired with mutation as a viable search tool for genetic algorithms.
Lets go over a trivial example. Suppose we have the following function below we want to maximize: The function is f(x)=-x²+3, where the maximum occurs when x=0 at f(0) = 3.
import numpy as np
def fitness_function(x):
return -(x**2)+3 # our fitness function is the function itself
size = 5 # initial population size
# domain
lower_bound = -3
upper_bound = 3
# initial generation -- needs to be random across entire domain for
# accurate representation
init_gen = np.random.uniform(lower_bound, upper_bound, size)
init_gen
array([-0.65641603, -2.69513081, -2.43704411, -0.17485507, -0.53588508])
fitness = fitness_function(init_gen)
fitness # fitness values of initial generation
array([ 2.569118 , -4.26373009, -2.93918398, 2.96942571, 2.71282719])
# because we are using roulette wheel selection, we cannot sum up negative
# fitness values; therefore, we need to scale so that all fitness values are
# positive; this is done by adding the absolute value of the minimum to each
# value; here is our new scaled fitness
scaled_fit = fitness + np.abs(np.min(fitness))
scaled_fit
array([6.83284809, 0. , 1.32454611, 7.23315579, 6.97655727])
# create proportion of selection by dividing each fit by sum
distribution = scaled_fit / np.sum(scaled_fit)
distribution
array([0.30548645, 0. , 0.05921848, 0.32338361, 0.31191147])
# instead of sorting the actual individuals, we can sort the indices
# and use those to retrieve the individuals
ind = range(0, size) # indicies
temp = np.column_stack((distribution, ind))
temp = temp[np.argsort(temp[:,0]),] # sort fitness values
temp # notice that now our indices are sorted along with fitness
array([[0. , 1. ],
[0.05921848, 2. ],
[0.30548645, 0. ],
[0.31191147, 4. ],
[0.32338361, 3. ]])
distribution = temp[:, 0] # retrieve sorted fitness values
distribution
array([0. , 0.05921848, 0.30548645, 0.31191147, 0.32338361])
# now that our fitness values are sorted, we can take the best 20%
elitism = 0.2
best_index = size-int(size*elitism)
# index in our indices where the best 20% start
print("Best Index {}".format(best_index))
best_ind = range(best_index, size)
best_ind = temp[best_ind, 1]
best_ind
Best Index 4
array([3.])
# create a cumulative distrubtion from our scaled proportional fitness values
cumulative_sum = np.cumsum(distribution)
cumulative_sum # this creates our roulette wheel
array([0. , 0.05921848, 0.36470493, 0.67661639, 1. ])
# Now it is time for selection, so we randomly create 5 values
# and see where they lie on the whell, AKA inbetween the
# cumulative sum above
r = np.random.uniform(0, 1, size)
r
array([0.74107001, 0.8177645 , 0.82451094, 0.65121664, 0.25922782])
sel_ind = []
for p in r: # for every random probability value in r
index = 0
# stop when the probablity value is greater than the value
# in the cumulative sum
# ex: say our cumulative sum is [ 0, 0.25, 0.6, 0.9, 1]
# our random value is 0.43, which lies between 0.25 and 0.6;
# therefore, index 2 is chosen
while p > cumulative_sum[index]:
index += 1
sel_ind.append(index)
sel_ind # the indicies of the selected individuals
[4, 4, 4, 3, 2]
mat_ind = np.random.choice(sel_ind, size, replace=False)
mat_ind # the mates of the first of parents
array([4, 4, 2, 3, 4])
# now that we have the indices of our parents selected, we need
# to get their actual values
# remember 'temp' holds the original indices of our population
ind = np.asarray(temp[sel_ind, 1], dtype=int).tolist()
selected = init_gen[ind] # get selected individuals
f1 = fitness[ind]
# do the same for the second set of parents
ind = np.asarray(temp[mat_ind, 1], dtype=int).tolist()
mates = init_gen[ind]
f2 = fitness[ind]
print(selected) # selected individuals
print(f1) # their fitness values
print(mates) # second set of parents
print(f2) # their fitness values
[-0.17485507 -0.17485507 -0.17485507 -0.53588508 -0.65641603] [2.96942571 2.96942571 2.96942571 2.71282719 2.569118 ] [-0.17485507 -0.17485507 -0.65641603 -0.53588508 -0.17485507] [2.96942571 2.96942571 2.569118 2.71282719 2.96942571]
# now it is time to create the offspring
# we will create 5 offspring to replace the five sets of parents
p_cross = 0.75 # we crossover probabilistcally
offspring = []
r = np.random.uniform(0, 1, 5) # random values to determine to crossover
for i in range(0, size):
if r[i] < p_cross:
# crossover technique is average, where gamma is some random
# value between 0 and 1
gamma = np.random.uniform(0, 1, 1)[0]
child = (1-gamma)*selected[i]+gamma*mates[i]
else:
# if we don't crossover, choose fittest parent as offspring
if f1[i] < f2[i]:
child = mates[i]
else:
child = selected[i]
offspring.append(child)
offspring
[-0.17485506659292493, -0.17485506659292493, -0.17485506659292493, -0.5358850762105027, -0.17485506659292493]
# now that we've crossed over the parents, we need to mutate to introduce
# new genetic material
p_mut = 0.75
# how much should we mutate by? It usually be around 1% of the
# entire domain space
bound_mut = 0.01 * (np.abs(lower_bound)+np.abs(upper_bound))
print("BOUND: {}".format(bound_mut))
r = np.random.uniform(0, 1, 5)
for i in range(0, size):
# we either mutate or don't
if r[i] < p_mut:
offspring[i] += np.random.uniform(-bound_mut, bound_mut, 1)[0]
offspring = np.asarray(offspring)
offspring
BOUND: 0.06
array([-0.18885104, -0.21426016, -0.22801573, -0.58815671, -0.12942555])
# now that we've created our new offspring, lets check out how well it
# does!
new_fit = fitness_function(offspring)
ind = range(0, size)
# combine the new fitness with the new indices
temp = np.column_stack((new_fit, ind))
temp = temp[np.argsort(-temp[:,0]),] # sort from largest to smallest
temp
array([[2.98324903, 4. ],
[2.96433528, 0. ],
[2.95409258, 1. ],
[2.94800883, 2. ],
[2.65407168, 3. ]])
# now it is time for elitism, we need to replace the worst 20% with the
# previous generation's best 20%
ind_replace = np.array(temp[best_index:size, 1], dtype=int).tolist()
ind_replace
# this is the set of indices from the preivous gen to replace
[3]
next_gen = np.copy(offspring)
next_gen[ind_replace] = init_gen[np.asarray(best_ind, dtype=int).tolist()]
next_gen # here we have our new generation
array([-0.18885104, -0.21426016, -0.22801573, -0.17485507, -0.12942555])
# here we can see the fitness values of our next generation
new_fit = fitness_function(next_gen)
print("Mean : {}".format(np.mean(new_fit)))
print("Best: {}".format(np.max(new_fit)))
Mean : 2.963822285465617 Best: 2.9832490266900367
# here is our old generation
print("Mean : {}".format(np.mean(fitness)))
print("Best: {}".format(np.max(fitness)))
Mean : 0.20969136487426815 Best: 2.969425705686784
# Awesome! This completes one iteration of the algorithm! We simply need
# to repeat this over and over again until it converges!
# In the next block I combined all these components into an function!
# init-gen : Initial generation
# p_cross: Probability of performing crossover
# p_mutate: Probability of mutating the offspring
# bounds: a set of lists to keep the domain of the variable between bounds
# mutate_bound: A percentage of the bounds of each variable to mutate by
# fitness_function: Self-explan.
# elitism: a percantage of the best from the generation to carry onto the next
# max_iter: Maximum number of generations to run
def evolve(init_gen, p_cross, p_mutate, bounds, mutate_bound, fitness_function, elitism=0.0, max_iter=100):
gen = np.copy(init_gen)
size = len(gen)
# find max value to mutate by for each variable
bound_mut = mutate_bound * (np.abs(lower_bound) + np.abs(upper_bound))
for j in range(0, max_iter):
fitness = fitness_function(gen)
print("Generation {}".format(j + 1))
print(" Mean : {}".format(np.mean(fitness)))
print(" Best: {}".format(np.max(fitness)))
# scale fitness so that all are positive
scaled_fit = fitness + np.abs(np.min(fitness))
# create proportion by dividing each fit by ssum
distribution = scaled_fit / np.sum(scaled_fit)
# sort based on indices rather than sorting the indvidiuals
ind = range(0, size)
temp = np.column_stack((distribution, ind))
temp = temp[np.argsort(temp[:, 0]),]
distribution = temp[:, 0]
# find the starting point of the best elitism% to carry over
best_index = size - np.maximum(1, int(size * elitism))
best_ind = range(best_index, size)
best_ind = temp[best_ind, 1]
# create cumulative sum for selection
cumulative_sum = np.cumsum(distribution)
# perform selection
r = np.random.uniform(0, 1, size)
sel_ind = []
for p in r:
index = 0
while p > cumulative_sum[index]:
index += 1
sel_ind.append(index)
# create second set of parents
mat_ind = np.random.choice(sel_ind, size, replace=False)
# now since we have the indices, lets get the individuals from
# those indices
ind = np.asarray(temp[sel_ind, 1], dtype=int).tolist()
selected = gen[ind]
f1 = fitness[ind]
ind = np.asarray(temp[mat_ind, 1], dtype=int).tolist()
mates = gen[ind]
f2 = fitness[ind]
# create offspring first through crossover, probabilistically
offspring = []
r = np.random.uniform(0, 1, size)
for i in range(0, size):
if r[i] < p_cross:
gamma = np.random.uniform(0, 1, 1)[0]
child = (1 - gamma) * selected[i] + gamma * mates[i]
else:
if f1[i] < f2[i]:
child = mates[i]
else:
child = selected[i]
offspring.append(child)
# now mutate that offspring probabalistically
r = np.random.uniform(0, 1, size)
for i in range(0, size):
if r[i] < p_mutate:
offspring[i] += np.random.uniform(-bound_mut, bound_mut, 1)[0]
offspring = np.asarray(offspring)
# replace worst with best from previous generation
new_fit = fitness_function(offspring)
ind = range(0, size)
temp = np.column_stack((new_fit, ind))
temp = temp[np.argsort(-temp[:, 0]),]
ind_replace = np.array(temp[best_index:size, 1], dtype=int).tolist()
# set up next generation for next iteration
next_gen = np.copy(offspring)
next_gen[ind_replace] = gen[np.asarray(best_ind, dtype=int).tolist()]
# set up for next iteration
gen = np.copy(next_gen)
return gen
size = 5
lower_bound = -3
upper_bound = 3
init_gen = np.random.uniform(lower_bound, upper_bound, size)
# domain bounds
bounds = [upper_bound, lower_bound]
last_gen = evolve(init_gen, 0.75, 0.75, bounds, 0.01, fitness_function, 0.1, 10)
Generation 1 Mean : -0.09260721851766425 Best: 2.696175885446254 Generation 2 Mean : 2.0823134310559333 Best: 2.703586692002732 Generation 3 Mean : 2.6574619511379205 Best: 2.941096322628953 Generation 4 Mean : 2.797779726227234 Best: 2.96975052460602 Generation 5 Mean : 2.9465912869982125 Best: 2.9982005864572976 Generation 6 Mean : 2.9940240859606355 Best: 2.999168543856943 Generation 7 Mean : 2.9966430467039205 Best: 2.999168543856943 Generation 8 Mean : 2.9976802962948588 Best: 2.9992814208746017 Generation 9 Mean : 2.9983627862324953 Best: 2.9998816370891994 Generation 10 Mean : 2.9991283348893623 Best: 2.999977200451066
last_gen
array([ 0.00477489, 0.00477489, 0.00477489, -0.03295003, 0.00477489])
There are three types of hyperparameters that influence the quality of the algorithm:
Static hyperparameters are just like they sound, they are static throughout the entire evaluation of the algorithm. Large static values encourage exploration (except for elitism), while small static values encourage exploitation (except for elitism).
Dynamic hyperparameters dynamically decrease from large values to small values over the course of the algorithm’s execution. This is performed to allow exploration during the early stages of the algorithm and exploitation later on. The problem with dynamic hyperparameters are that the parameters will decrease no matter what as each generation passes, even if they are performing well, while the problem with static hyperparameters is that they never change.
Self-Adaptive hyperparameters only change depending upon how successful they are, if the max mutation bound value is too low it will self-adapt and increase, or decrease if it is too high. These parameters are integral part for unit 5.
For Dynamic hyperparameters, the most common way to handle this dynamic decrease is either through linear or non-linear functions.
Where $p_{max}$ is the max value for the parameter, $t$ is the generation, and $r$ is the rate of decay. Smaller values for $r$ encourages exploitation, while larger values for $r$ encourage exploration.
Because Logistic Decay is logistic in nature, as the generation count approaches infinity the parameter value will approach zero, which will lead to stagnation and premature convergence. We do not want this to happen in practice, so we implement a max operation such that if the new parameter value falls below some minimum it will be set to that minimum.
Generation Gap Methods are variants that implement different survival selection methods for determining who survives into the next generation. There are three popular methods for this approach:
In Parent-Offspring battles, the pair of parents face off against their set of offspring, where the offspring are accepted probabilistically based upon their fitness value. The most common way to handle this is through Boltzmann Selection, which we will cover in the next post.
Like in the first method, we always create offspring, thus removing the need for probabilities of crossover and mutation, and pool them together with the parents. After pooling the parents and offspring together, you use tournament selection for survival, except where the worst fit individual is killed off. You perform this until the offspring parent pool decreases back down to its original size. The benefit of this procedure is that that battle is not directly between the parents and offspring pairs, but within the total pooled population, as the offspring between a set of parents could be worse but yet better than another set of parents.
Elitism strategies are like that of the second, pooling all the parents and offspring together, and then sorting them based off fitness or age. If sorting based off fitness, this method is like elitism in that the best half from the sorted individuals are chosen to survive. Besides sorting off fitness, one could sort by age and fitness where younger individuals are preferred over older. Age is denoted in algorithms by how many generations they have survived. The benefit of this procedure is that the algorithm does not hold onto old solutions, which could cause stagnation; but then the problem becomes that the algorithm will reject good old solutions in favor of new even if they are worse.
The population size of the algorithm dynamically grows to model population sizes in nature.
There are two common ways to do this.
Running in Parallel variants are Genetic Algorithms designed to run in parallel.
In industry, running Genetic Algorithms on real world problems can be extremely time consuming and power hungry. We can improve the complexity cost by running our algorithm in parallel. This can be done in three ways: perform our fitness function in parallel, assign a processer per individual, or assign sub populations per processor.
Island Algorithms, as we can see above, are extremely close to Running in Parallel variants in that they are usually performed on multiple processors where each processor represents an island; however, these algorithms can be implemented on a single processor.
In Island Algorithms we have what are known as migration policies, which contain initialization, communication topology, migration rate, migration selection, and replacement selection.
Migration and replacement selection can be performed using any of the selection strategies discussed thus far in the series. The purpose of migration is to introduce new genetic material to an island.
One of the main problems with crossover is that it may lead to poor individuals. Take for example the following function below:
Problem: We can see here from the topology that there are four equal global minimum values at each corner of the domain. If our algorithm has two individuals with similar fitness scores but at different locations, and we were to crossover using the averaging strategy our offspring would end up in the middle at a poor location.
This is where the offspring parent battle would be a useful mechanism as the parent would not be replaced due to the inadequacy of the offspring. On the other hand, if we did not have this component, we could implement selective mating through clustering.
Selective mating means that individuals are only allowed to mate with individuals within the same cluster, thus preventing two individuals with a far Euclidean distance from mating.
The only problem with this line of thinking is that can lead to poor exploration due to individuals only mating and reproducing within the same cluster, thus reducing the global sharing of genetic material.
Selective mating can easily be implemented with the island algorithm as only individuals within the same island are allowed to mate and reproduce. The only problem is determining the number of islands and who goes where. A solution to the number of islands is to use some type of criterion, like Silhouette score in K-means clustering.
As choosing the cluster number for k means clustering is complicated, we can choose the best number of clusters by the silhouette method. This method starts with two initial clusters and increases until it find the cluster count where the distances between each point cluster and the corresponding k mean point is the smallest.
This is denoted by a silhouette score between -1 and 1, where 1 means the clusters are very dense with no overlap, as we can see in the picture on the right hand side, and -1 where data does not belong to cluster. We keep increasing the cluster count until our max cluster choice and choose the cluster count with the best score.
This can be used on the initial population to divide it into clusters to be partitioned in the island algorithm.
Note that choosing a uniform initial population will not work with K-means clustering; therefore, when uniform initial populations are used we can instead divide the domains space into squares and give each of those subdomains to an island, and then use K-means clustering for choosing who reproduces with whom.
Memorizing the Unit 1, there are four typrs of optimization problems:
Unconstrained problems are easy, we’ve been doing them this entire time.
However, one thing we have not covered are problems where the domain is discrete. The most popular way to get around this is to not simply change anything, but round the variable value to the next integer when it comes time to test the fitness function.
For constrained problems, the common solution is simply to add penalty terms for solutions that are infeasible, or change the reproduction and initialization operators such that no infeasible solution is created.
For Multi-objective problems, one can create an aggregate function by simply adding all objective functions together and minimizing that total function; or, one could use pareto dominance where the parents and offspring are pooled together and a tournament style selection is used and the individual that is the most dominate is chosen to survive.
Multi solution problems can be solved by using a form of an island algorithm. In multi solution problems the global min or max is usually known but the goal is to find every location where this occurs.
Therefore, in our island algorithm, each time a new individual is spawned where its fitness value is close or equal to the global min or max but its Euclidean distance is different by some value from the mean of the island, it migrates to an island where the mean is closest.
However, if no island exists where its mean Euclidean distance is different from the individuals by some value, it creates its own island by itself and reproduces asexually until new offspring or new migrants are present.
In this way, island algorithms can solve multi solution problems by having each island be its own cluster with the same global min or max solution. After running for the generation count, the number of islands represents the number of different positions found.
The much anticipated application of evolving the weights of a Neural Network for Time Series Analysis!
Time series analysis refers to any series of data where the data are sequential in terms of order of time.
Time series analysis is extremely applicable in many fields, from economic to weather projections and forecasting.
The problem we will be analyzing will be a Sunspot Time Series dataset. Sunspots Cycle’s are solar magnetic activity that occurs on the sun. The monthly mean number of sunspots has been recorded since 1749. Our problem is to see how well we can model this time series problem and predict future monthly mean numbers. Here we have a view of what the time series data set looks like over its entire lifespan:
Data is from Kaggle.
We will be attempting to solve this problem through a Feed Forward Neural Network.
Artificial neural networks were derived with biological inspiration from biological neural networks, the basic unit of the brain. These models have the ability to perform classification, regression, and other techniques. Neural Networks have become extremely common in Machine Learning and Artificial Intelligence due to their unexpected success in many problem types.
The architecture of a neural network can be splitted into three components:
The weights between each layer is represented by a matrix.
We represent each connection between two neurons by a weight, which can be combined to form a matrix. A forward pass through our network is simply multiplying the matrices together, applying our transition function at each layer, and then getting our result.
The bias is an [1,M] vector added to each observation after the activation function, where M is the number of nodes for that layer.
Neural networks are commonly trained through a form of backpropagation; where the current error, either MSE or Cross-Entropy, is propagated back from the output layer throughout the hidden layers to adjust the weights such that they minimize the error. However, in our application, instead of using backpropagation we will use a genetic algorithm to train the weights.
The most common implementation for solving time series problems in neural networks is by using a recurrent neural network.
A recurrent neural network is a network where there is a recurrent layer from the output that is fed back into the input layer for the next value at the subsequent time index.
However, an easy way to get around this implementation is to simply make the input layer a ‘window’ that spans some value of indices before the current value.
The goal is to predict values in black boxes. There is used the values in gray boxes before black boxes. Gray boxes make up a window for prediction. Current window size is 3. Actually it has to be tested.
Because we will be evolving the weights of a neural network, we need to implement one from scratch as it would require a lot of maneuvering to do so in common Python implementations. We will implement our network to accommodate an arbitrary number of layers and nodes per layer, but will only have the ReLU activation function for each layer.
# ReLU Activation Function
def relu(x):
return np.maximum(x, 0)
# A Neural Network used For Evolution
# Our algorithm will create a list of these objects
class EvolvableNetwork:
# Layer Nodes is a list of int values denoting the number of nodes per layer
# For example, if layer_nodes = [3, 5, 3], we have three hidden layers with a 3-5-3 node architecture
# num_input and num_output refer to the number of input and output variables
# I will explain the purpose of the Initialize boolean later, but simply if False, do not create the weight
# and bias matrices.
# All weight and bias matrices are initialized uniformly random between -1 and 1
def __init__(self, layer_nodes, num_input, num_output, initialize=True):
self.layer_count = len(layer_nodes)
self.layer_nodes = layer_nodes
self.num_input = num_input
self.num_output = num_output
self.activation_function = relu # contains a function pointer
self.layer_weights = []
self.layer_biases = []
if not initialize: # I will discuss the purpose of this later
return
# create the NxM weight and bias matrices for input Layer
self.layer_weights.append(
np.random.uniform(-1, 1, num_input * layer_nodes[0]).reshape(num_input, layer_nodes[0]))
self.layer_biases.append(np.random.uniform(-1, 1, layer_nodes[0]))
# create the weight matrices for Hidden Layers
for i in range(1, self.layer_count):
self.layer_weights.append(
np.random.uniform(-1, 1, layer_nodes[i-1]*layer_nodes[i]).reshape(layer_nodes[i-1], layer_nodes[i]))
self.layer_biases.append(np.random.uniform(-1, 1, layer_nodes[i]).reshape(1, layer_nodes[i]))
# Create the weight and bias matrices for output Layer
self.layer_weights.append(
np.random.uniform(-1, 1, layer_nodes[self.layer_count-1]*num_output).reshape(layer_nodes[self.layer_count-1],
num_output))
self.layer_biases.append(np.random.uniform(-1, 1, num_output).reshape(1, num_output))
# same as forward pass, performs matrix multiplication of the weights
def predict(self, x):
output = self.activation_function(np.dot(x, self.layer_weights[0]) + self.layer_biases[0])
for i in range(1, self.layer_count + 1):
if i == self.layer_count: # last layer so don't use activation function
output = (np.dot(output, self.layer_weights[i]) + self.layer_biases[i])
else:
output = self.activation_function(
np.dot(output, self.layer_weights[i]) + self.layer_biases[i])
if self.num_output == 1: # if there is only one output variable then reshape
return output.reshape(len(x), )
return output
The most common downfall for all machine learning models is known as over-fitting. This occurs when our model has so many tunable parameters that it starts to memorize the input. As a result, it performs worse when predicting a value that it has never seen before than the values it was trained upon.
The goal of all of Machine Learning and Data Science is to create simple but yet powerful models that can adapt and can interpolate new values accurately.
There are many different ways to prevent overfitting in neural networks, from dropout layers to cross-validation; however, the most common is simply early stopping.
We use our training dataset to train our model and then test its performance using the testing dataset. In this way, we train on our training set and once we start to see overfitting, meaning the error in our testing set starts to increase, we stop our algorithm. This is known as early stopping.
In this case we have already used the test set. Thus nothing to test on after training hyperparameters. To get around this, we can introduce the validation dataset, which replaces the testing dataset in early-stopping.
We now split our dataset into three parts,
By doing this, we ensure that our model has never seen the values in the testing dataset in order to get an accurate evaluation of our model.
Now it’s time to discuss the exact implementation details of our evolutionary algorithm.
A. As discussed in the previous posts, each individual is made up of its genotype and phenotype.
In our problem, the genotype is made up of weight and bias matrices, where each matrix is a gene in the genome. For the phenotype, the weight and bias matrices compiled together form a neural network in the environment.
B. Now that we know how we will encode our chromosome, it is time to discuss the reproduction operators. For selection, we will use roulette wheel selection, which works by creating a cumulative distribution from the proportion of an individual being chosen based off its fitness value:
def roulette_wheel_selection(cumulative_sum, n):
ind = []
r = np.random.uniform(0, 1, n)
for i in range(0, n):
index = 0
while cumulative_sum[index] < r[i]:
index += 1
ind.append(index)
return ind
C. For crossover, we will implement the averaging technique, which takes a linear combination of the parent values. For our problem, the offspring weight and bias matrices will just be a linear combination of the parent’s matrices. To do this, we first instantiate a new EvolvableNetwork, except this time with initialize equal to False, because we do not want the weight and bias matrices to be initialized to random values as we will create them from the parents:
# p1 and p2 are parents
# const_cross is the coefficient for the linear combination, ranges between [0, 1]
# if near 0, it will favor p1; if near 1, it will favor p2; if equal to 0.5, it is the mean between the parents
def crossover(p1, p2, const_cross):
# initialize new network with empty layer weights and biases
child = EvolvableNetwork(layer_nodes=p1.layer_nodes, num_input=p1.num_input, num_output=p1.num_output,
initialize=False)
# fill child weight and bias matrices from the parents
for i in range(0, p1.layer_count+1):
child.layer_weights.append( (1-const_cross)*p1.layer_weights[i]+const_cross*p2.layer_weights[i])
child.layer_biases.append( (1-const_cross)*p1.layer_biases[i]+const_cross*p2.layer_biases[i])
return child
D. For mutation, we will simply add some small random value to each entry of the weight and bias matrix for all matrices:
# const_mutate is the max value to mutate by
def mutation(child, const_mutate):
# loop over all layers
for i in range(0, child.layer_count+1):
n, c = child.layer_weights[i].shape
# these are the random weights to add the current child
r_w = np.random.uniform(-const_mutate, const_mutate, n*c)
# loop over all rows and columns for the current layer
for nr in range(0, n):
for nc in range(0, c):
child.layer_weights[i][nr, nc] += r_w[nr*c+nc]
# loop over all layers
for i in range(0, child.layer_count+1):
c = child.layer_biases[i].shape[0]
# these are the random weights to add the current child
r_w = np.random.uniform(-const_mutate, const_mutate, c)
# loop over all columns of the vector
for nc in range(0, c):
child.layer_biases[i][nc] += r_w[nc]
Unlike in our previous Genetic Algorithm implementations, this particular implementation will not have hyperparameters for probabilities of mutation, crossover, or elitism.
E. Instead, we can reduce the complexity of tuning our algorithm by getting rid of these parameters. To do this, our set of parents will create a set of four children, where the children will be pooled along with their parents and the individual with the best fitness will be chosen to survive. For the offspring, all four will be created through crossover by different coefficient values, and then the last two cross-over offspring will be mutated by different random values.
Here is what our reproduce function will look like for two parents:
# p1 and p2 are parents
# const_mutate is the max value to mutate by
def reproduce(p1, p2, const_mutate):
# create a different gamma coefficient for averaging
# crossover for each offspring
c_cross = np.random.normal(0.5, 0.15, 4)
ch1 = crossover(p1, p2, c_cross[0])
ch2 = crossover(p1, p2, c_cross[1])
ch3 = crossover(p1, p2, c_cross[2])
ch4 = crossover(p1, p2, c_cross[3])
# mutate only two of the individuals
mutation(ch3, const_mutate)
mutation(ch4, const_mutate)
# pool offspring with parents
all = [p1, p2, ch1, ch2, ch3, ch4]
fit = fitness_function(all)
# return the individual with the min fitness value
return all[np.argmin(fit)]
F. Now it is time for our fitness function, it will take the Mean Sum of Square Errors between the predicted and the actual values for our time series problem: $$ MSE = \frac{\sum_i^n (y_i-\hat{y}_i)}{n}$$
from sklearn.metrics import mean_squared_error
# models represents the list of networks
# data is composed of the time series input and output
def fitness_function(models, data):
mse_values = []
x = data[0]
y = data[1]
for network in models:
predictions = network.predict(x)
mse = mean_squared_error(y, predictions)
mse_values.append(mse)
return np.asarray(mse_values)
Because we are wanting to minimize the MSE error function, we need to scale our fitness values such that smaller values yield larger, and larger values yield smaller and then maximize that scaled fitness. We can do this through the following function:
def scale_fitness(x):
return 1 / (1 + x)
Relationship between original fitness values and scaled fitness values:
Next, we need to split our data into the training, validation, and testing sets. We will train our algorithms using the training dataset and use the validation set to compare our models and to prevent overfitting by early stopping. Lastly, after we’ve chosen our final model, we will evaluate it’s accuracy through the test dataset.
We will pass our training and validation data to our evolution algorithm. As stated before, we will test different window sizes, so we will loop over all possible window sizes, create the data, and test our algorithm:
import numpy as np
import pandas as pd
df = pd.read_csv("Sunspots.csv")
y = np.asarray(df['Monthly Mean Total Sunspot Number'])
size = len(y)
# 50% of data for training
train_ind = int(size * 0.50)
# 25% of data for validation and other 25% for testing
val_ind = int(size * 0.75)
max_window = 10
min_window = 3
initial_population_size = 100 # 10 neural networks
best_models = [] # best model from each run of the algorithm per window size
best_fits = []
# randomly shuffle data through indices:
shuffled_indices = np.asarray(range(0, size-max_window))
np.random.shuffle(shuffled_indices)
# loop over each window size
for vision in range(min_window, max_window + 1):
input = []
output = []
# creates the window length size for each value
# because the first couple values will not have
# a full window we skip them, that's why start
# at i and not 0
for j in range(vision, size):
input.append(y[(j - vision):j].tolist())
output.append(y[j])
input = np.asarray(input)
output = np.asarray(output)
temp = np.column_stack((output, input))
# instead of shuffle each time here, we shuffle once outside loop
# so that all window sizes have the same final array
temp = temp[shuffled_indices]
output = temp[:, 0]
input = temp[:, 1:]
y_train = output[0:train_ind]
y_val = output[train_ind:val_ind]
y_test = output[val_ind:size]
x_train = input[0:train_ind]
x_val = input[train_ind:val_ind]
x_test = input[val_ind:size]
init_gen = []
for i in range(0, initial_population_size):
init_gen.append(EvolvableNetwork(layer_nodes=[5, 5, 5], num_input=vision, num_output=1, initialize=True))
best_model = evolve(init_gen, const_mutate=0.1, max_iter=200, train_data=[x_train, y_train], val_data=[x_val, y_val])
best_models.append(best_model)
best_fits.append(fitness_function([best_model], [x_val, y_val]))
It is time to define the body of the evolutionary algorithm:
# const_mutate in our example is the actual max value to mutate by, not percentage
# use train and val data to prevent overfitting - we early stop if the mean of
# validation data increases for three straight iterations
def evolve(init_gen, const_mutate, max_iter, train_data, val_data):
gen = init_gen
mean_fitness = []
val_mean = [] # validation mean value
best_fitness = []
prev_val = 1000
n = len(gen)
val_index = 0
for k in range(0, max_iter):
fitness = fitness_function(gen, train_data)
# scale so that large values -> small
# and small values -> large
scaled_fit = scale_fitness(fitness)
# create distribution for proportional selection
fit_sum = np.sum(scaled_fit)
fit = scaled_fit / fit_sum
cumulative_sum = np.cumsum(fit)
selected = roulette_wheel_selection(cumulative_sum, n)
mates = roulette_wheel_selection(cumulative_sum, n)
children = []
for i in range(0, n):
children.append(reproduce(gen[selected[i]], gen[mates[i]], const_mutate, fitness_function, train_data))
gen_next = children
# evaluate training data
fit = fitness_function(gen_next, train_data)
fit_mean = np.mean(fit)
fit_best = np.min(fit)
mean_fitness.append(fit_mean)
best_fitness.append(fit_best)
# evaluate validation data
val_fit = fitness_function(gen_next, val_data)
val_fit_mean = np.mean(val_fit)
val_mean.append(val_fit_mean)
print("Generation: " + str(k))
print(" Best: {}, Avg: {}".format(fit_best, fit_mean))
print(" Validation: {}".format(val_fit_mean))
gen = gen_next
# check if previous iteration validation increased or decreased
if val_fit_mean > prev_val:
val_index += 1
else:
val_index = 0
if val_index == 3: # val has increased for three straight iterations
print("Over Fitting, Stopping...")
break
prev_val = val_fit_mean
# use validation data to choose best model from current generation
val_fit = fitness_function(gen_next, val_data)
best_val = np.min(val_fit)
best_ind = np.argmin(val_fit)
print("Best Model: ")
print(" Validation: {}".format(best_val))
return gen_next[best_ind]
For evaluation, we will find the model with the best validation score from each window size and then recreate the data based off that window size to evaluate our test MSE score.
# get best model
best_index = np.argmin(best_fits)
best_model = best_models[best_index]
# recreate data with that window size
vision = best_index + min_window
input = []
output = []
for j in range(vision, size):
input.append(y[(j - vision):j].tolist())
output.append(y[j])
input = np.asarray(input)
output = np.asarray(output)
temp = np.column_stack((output, input))
temp = temp[shuffled_indices]
output_2 = temp[:, 0]
input_2 = temp[:, 1:]
y_test = output_2[val_ind:size]
x_test = input_2[val_ind:size]
# evaluate test data
mse_test = fitness_function([best_model], [x_test, y_test])
print("\nBest Validation Fitness Values Per Window Size:")
index = 0
for fit in best_fits:
print("Window Size: {} - Validation MSE: {}".format(index+min_window, best_fits[index][0]))
index += 1
print("Best Model: \n"
" Window Size : {}\n"
" MSE for Test Data Set : {}".format(best_index+3, mse_test[0]))
# printing best model predictions
xaxis = range(vision, len(y))
plt.plot(xaxis, y[vision:], label="Actual")
plt.plot(xaxis, best_model.predict(input), label="Prediction")
plt.xlabel("Months")
plt.ylabel("Mean Total Sunspot Number")
plt.title("Sunspot Cycle From 1749-2021")
plt.legend()
plt.show()
Differential Evolution differs from standard genetic algorithms in that
The general flow of the DE algorithm can be depicted as follows:
Compare it to the flow of the general GA:
The mutation operator in Differential Evolution - the unit vector - is created from two components:
An exact definition below:
where $i\neq i1 \neq i2 \neq i3$, $u_i$ is unitvector, $x_{i1}$ is target vector, $x_{i2}$ and $x_{i3}$ are selected parents, and $\beta$ is a scaling coefficient
We can see the geometric power of the unit vector below:
In the figure, there are three parent vectors, X1, X2, X3. We create the difference vector between X2 and X3, then scale by beta. After this scale, we add it to our target vector, X1, to obtain the final unit vector, U.
The unit vector, the final product of the mutation process, is passed along to be crossed over with the current ‘major’ parent, the parent of interest not included in the mutation operation.
As in Genetic Algorithms there are two types of crossover techniques in Differential Evolution: average and intuitive.
After creating the unit vector, we select the 'major' parent (different from the ones used before), an then perform crossover between this new parent and the unitvector to obtain final offspring.
Example:
In the example we can see how the coordinates of the unit vector $U(x_u,y_u)$ and the chosen 'major' parent $P(x_p,y_p)$ is combined to get offspring: child1 $C_1(x_p,y_u)$ and child2 $C_2(x_u,y_p)$.
There are many different strategies for creating unit vectors, known as the DE/x/y/z Strategies, where x denotes the chosen target vector, y denotes the number of difference vectors, and z represents the crossover method.
Here are the most common strategies:
DE/Best/1/z: This method uses only one difference vector, where the parents are randomly selected, and whose target vector is the best individual in the current population: $$u_i=\hat{x}+\beta(x_{i2}-x_{i3})$$
DE/Rand/n/z: This method works by selecting a random target vector and n randomly selected difference vectors: $$u_i=x_{i1}+\beta\sum_k^{n_v}(x_{i2,k}-x_{i3,k})$$
DE/Rand-to-Best/n/z: This method works by incorporating a linear combination between the best individual and a random parent for creating the target vector, and n randomly selected difference vectors: $$u_i=\gamma\hat{x}+(1-\gamma)x_{i1}+\beta\sum_k^{n_v}(x_{i2,k}-x_{i3,k})$$
DE/Current-to-Best/n/z: This method works by incorporating the ‘major’ parent along the best individual for creating the target vector, and n randomly selected difference vectors: $$u_i=x_i+\beta(\hat{x}-x_i)+\beta\sum_k^{n_v}(x_{i2,k}-x_{i3,k})$$
Using
A compromise between the two can be made by the use of Combined Strategies.
We can choose between following parameter strategies:
The major parameter in DE is the scaling coefficient $\beta$, which usually takes values between $[0,\infty)$:
It is standard to have beta around a value of 0.5 when using one difference vector and decrease by the ratio of the number of difference vectors.
where is evolved the architecture of a Convolutional Neural Network for classifying images on the CIFAR-10 dataset.
Dataset: 60000 CIFAR10 small 32x32 RGB color images with 10 possible categories.
Model: a Convolutional Neural Network for categorizing the images of the dataset
Automated Machine Learning is a field of Optimization and Machine Learning theory where the goal is to find the best set of hypermeters for a given algorithm.
Why to use DE for Automated Machine Learning?
There is needed a critical assumption that the input space of our architecture only has one or a few local optima. On the other hand, if you have multiple processors and GPU’s that can spread out the computation, then using a standard genetic algorithm would be beneficial.
Another alternative is to use Evolutionary Programming which solves the problem of crossovers which is the Permutation Problem where the offspring is extremely poor despite the parents being extremely good. Evolutionary Algorithm does not use the crossover, only mutation, allowing the speciation.
The CNN will be split into two sections, concolution network and deep network, where each is composed of respective modules.
The algorithm will have the ability to
The possible choices per layer in each individual convolution module:
Table can be read so:
etc.
Again, the algorithm will have the ability to
Here are the possible choices per layer in each individual module:
The permutation problem, otherwise known as the competing conventions problem, specifically refers to the problem of performing crossing over with two or more parents to obtain the offspring.
The permutation problem arises when the phenotypes of the network architectures between the parents are so different that much of the genetic information is lost during the crossover process, often producing an offspring that is extremely poor.
For example:
In this example, there is evolved the weights, number of layers, number of nodes, along with activation functions per layer for a population of networks.
The problem is that the architectures between the parents are so different that when we perform crossover, we get a hybrid offspring with a brand new architecture.
The general solutions for the permutation problem are:
Solution for the current problem is to use two exactly the same architectures for CNN, but different optimizer or activation function. For the simplification there will be no evolving or speciation for optimizers and action functions. These will be static.
Dataset:
Training:
In this way, the training time can be minimized many times and the power of the evolutionary search process maximized.
Algorithm:
Goal of the algorithm is to find the best possible CNN which classifies the CIFAR-10 images into 10 classes.
Initial population: Population of 20 CNN-s with similar architecture.
Selection for Multi-Parent Reproduction: one parent (CNN) is target vector, to parents are used for difference vector and final parent is the 'major' parent which has the best finess.
Mutation: Creating Unit Vector: for creating the unit vector (for every individual (CNN) in the generation) there is used the DE/Rand/n/z strategy with only one difference vector: $$u_i=x_{i1}+\beta(x_{i2}-x_{i3})$$
Crossover 'Major' Parent with Unit Vector to create Offspring: 'major' parent is crossed over with unit vector to obtain offspring (new CNNs) using averaging technique
Select best from Child and Parent: if the architecture of the child (CNN) yields better validation accuracy than the 'major' parent then the parent will be replaced. Validation accuracy is the result of the training and validation processes that is implemented by fitness function. Fitness function calculates the fitness for each individual (CNN) in the generation.